Clustering and classification (week 4)

Clustering and classification are statistical methods that can be used to investigate the relationships between variables in a data set and partition the data set into groups. Classification can be used when we have a training data set available which contains the classes each data point belongs to. Clustering can be used even when this information is not available, by partitioning the data set based on distance of the variables.

We will investigate crime rates in suburbs of Boston using the Boston data set from the MASS library1. This data set has 506 rows and 14 variables, containing various statistics from suburbs of Boston, such as land zoning, location, housing values and population demographics. The variables are (roughly described) crim (crime rate per capita), zn (proportion of residential land zoned for large lots), indus (proportion of industrial land), chas (close to Charles River), nox (nitrous oxides concentration), rm (average number of rooms per home), age (proportion of old homes), dis (distance to Boston employment centres), rad (accessibility of radial highways), tax (property tax rate), ptratio (pupil-teacher ratio), black (proportion of black people), lstat (proportion of lower status), medv (median value of homes). The exact definitions of these variables are given in the MASS library documentation. We are interested in how the crime rate is affected in each suburb by the other variables. Using classification, we want to predict the crime rate in each suburb using the other variables using classification. Using clustering, we want to find out if the data set can be partitioned based on the variables in a meaningful way.

library(MASS)

# load Boston data set from MASS library
data("Boston")

# explore structure of Boston data set
# 506 rows, 14 variables
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
boston_summary <- summary(Boston)

# create summary table for data set
summary_table <- function(data, cap) {
  data_summary <- as.data.frame(apply(data, 2, summary))
  kable(data_summary, caption = cap, digits=2) %>%
  kable_styling(font_size = 12)
}

# make summary of Boston data set
summary_table(Boston, "Unscaled Boston data set summary.")
Unscaled Boston data set summary.
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
Min. 0.01 0.00 0.46 0.00 0.38 3.56 2.90 1.13 1.00 187.00 12.60 0.32 1.73 5.00
1st Qu. 0.08 0.00 5.19 0.00 0.45 5.89 45.02 2.10 4.00 279.00 17.40 375.38 6.95 17.02
Median 0.26 0.00 9.69 0.00 0.54 6.21 77.50 3.21 5.00 330.00 19.05 391.44 11.36 21.20
Mean 3.61 11.36 11.14 0.07 0.55 6.28 68.57 3.80 9.55 408.24 18.46 356.67 12.65 22.53
3rd Qu. 3.68 12.50 18.10 0.00 0.62 6.62 94.07 5.19 24.00 666.00 20.20 396.22 16.96 25.00
Max. 88.98 100.00 27.74 1.00 0.87 8.78 100.00 12.13 24.00 711.00 22.00 396.90 37.97 50.00

A summary of the variables in the Boston data set is given in table @ref(tab:sm4). Histograms for each of the variables are plotted in figure @ref(fig:hi4). We can see that the different variables have different means and standard deviations, depending on how those variables are defined. The median crime rate is 0.26, while the mean crime rate is 3.61, which suggests that there are many areas with low crime rates, while a few areas have high crime rates. The residential and industrial land variables zn and indus have approximately the same mean of about 11%. It’s notable that the median for zn is 0%, which could mean that there are many suburbs with only small zoned lots. The variable chas is a dummy variable, so it’s 1 for suburbs bounding the river and 0 otherwise. The mean concentration of nitrogen oxides nox is 0.55 parts per 10 million, and the distribution is fairly even. The average number of rooms in homes is 6, and most of the owner-occupied homes have been built prior to 1940. The mean distance to employment centers dis and accessibility of radial highways rad are 3.8 and 9.55 respectively. Many suburbs are close to employment centers and have good accessibility to radial highways, while there also are many suburbs that have poor accessibility. The property has some suburbs with low tax rates and some with high tax rates, having a mean of 408. The mean pupil to teacher ratio is 18, with some amount of variation between suburbs. Many suburbs have a sizable proportion of black people a few have only a small proportion (minimum 0.32, while 1st quantile is 375.38). There are a many suburbs with a larger low-status population, while there are fewer suburbs with a smaller low-status population. The value of homes has a mean of $23000, ranging from $5000 to $50000.

# plot histogram from vector
vhist <- function(var, xlab) ggplot() + aes(var) + 
  geom_histogram() + xlab(xlab) + ylab("Count")

# plot histograms for all variables in boston data set
boston_hist <- function(data) {
  h1 <- vhist(data$crim, "Crime per capita")
  h2 <- vhist(data$zn, "Proportion of residential land")
  h3 <- vhist(data$indus, "Proportion of industrial land")
  h4 <- vhist(data$chas, "Bounds river")
  h5 <- vhist(data$nox, "Nitrogen oxides concentration")
  h6 <- vhist(data$rm, "Rooms per dwelling")
  h7 <- vhist(data$age, "Proportion of old homes")
  
  h8 <- vhist(data$dis, "Distance to employment centres")
  h9 <- vhist(data$rad, "Accessibility to radial highways")
  h10 <- vhist(data$tax, "Property tax rate")
  h11 <- vhist(data$ptratio, "Pupil to teacher ratio")
  h12 <- vhist(data$black, "Proportion of black people")
  h13 <- vhist(data$lstat, "Lower status of population")
  h14 <- vhist(data$medv, "Value of homes")
  
  grid.arrange(h1, h2, h3, h4, h5, h6, h7, 
               h8, h9, h10, h11, h12, h13, h14, 
               ncol=7, nrow=2)
}

# plot histogram of variables
boston_hist(Boston)
Histograms of variables in unscaled Boston data set.\label{fig:hi4}

Histograms of variables in unscaled Boston data set.

We can examine how the variables in the Boston data set are related to each other by plotting the correlation matrix. In figure @ref(fig:cm4), the correlation coefficients between each of the variables are listed (lower half) and visualized using circles (upper half). The statistical significance level is shown inside of each circle using stars (*** 0.001, ** 0.01, * 0.05). According to the figure, many of the variables in the Boston data set are highly correlated with each other. The most significant variables correlated with the crime rate are rad, tax and lstat. This means that accessibility to radial highways and tax rate contributes the crime rate, while the low-status population also plays a significant role. This could mean that crime rate is higher in central parts of the city which also have a higher low-status population.

# calculate correlation coefficient matrix for boston data set
boston_cor <- cor(Boston) %>% round(digits = 2)

# calculate 95% confidence intervals and p-values
boston_conf <- cor.mtest(Boston, conf.level = .95)
boston_p <- boston_conf$p
# drop lower p-values
boston_p[lower.tri(boston_p, diag = T)] <- 1

# create correlation matrix with coefficients and p-values
corrplot.mixed(boston_cor, 
         p.mat = boston_p,
         upper = "circle",
         lower = "number",
         insig = "label_sig",
         pch.col = "black",
         pch.cex = 1.0,
         lower.col = "black",
         sig.level = c(.001, .01, .05))
Correlation matrix of Boston data set with correlation coefficients and $p$-values.\label{fig:cm4}

Correlation matrix of Boston data set with correlation coefficients and \(p\)-values.

We standardize the Boston data set by shifting the mean of each variable \(\mu\) to 0 and scaling the standard deviation \(\sigma\) to 1. Each variable \(x\) is standardized by \(x' = \frac{x - \mu}{\sigma}\) using the function scale(). The summary of the scaled Boston data set is shown in table @ref(tab:sms4) and histograms of the variables are shown in figure @ref(fig:his4). We can see that the means of each variable have been shifted to 0. Similarly, the distributions of each variable have been scaled to a more similar range. The distributions of the different variables have different shapes (which are unchanged), so the ranges are not exactly the same, but they have the same standard deviation.

We create a categorical (factor) variable crime from the crim by splitting the crime rate distribution along the 1st and 3rd quantiles. This results in 4 classes of crime rate: low, med_low, med_high and high. The old crime rate crim is removed from the data set.

We split the scaled Boston data set into two data sets: train and test. The test data set is created by randomly sampling 80% of the data in the original scaled Boston data set. The remaining 20% is the test data set. The target variable crime is removed from the test data set.

# normalize boston data set by scaling mean and standard deviation
boston_scaled <- as.data.frame(scale(Boston))

# check normalized data set
boston_scaled_summary <- summary(boston_scaled)

# make summary of scaled Boston data set
summary_table(boston_scaled, "Scaled Boston data set summary.")
Scaled Boston data set summary.
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
Min. -0.42 -0.49 -1.56 -0.27 -1.46 -3.88 -2.33 -1.27 -0.98 -1.31 -2.70 -3.90 -1.53 -1.91
1st Qu. -0.41 -0.49 -0.87 -0.27 -0.91 -0.57 -0.84 -0.80 -0.64 -0.77 -0.49 0.20 -0.80 -0.60
Median -0.39 -0.49 -0.21 -0.27 -0.14 -0.11 0.32 -0.28 -0.52 -0.46 0.27 0.38 -0.18 -0.14
Mean 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3rd Qu. 0.01 0.05 1.01 -0.27 0.60 0.48 0.91 0.66 1.66 1.53 0.81 0.43 0.60 0.27
Max. 9.92 3.80 2.42 3.66 2.73 3.55 1.12 3.96 1.66 1.80 1.64 0.44 3.55 2.99
set.seed(6344) # get rid of randomness

boston_hist(boston_scaled)
Histograms of variables in scaled Boston data set.\label{fig:his4}

Histograms of variables in scaled Boston data set.

# categorize crime into 4 categories low, med_low, med_high and high by quantile
crime <- cut(boston_scaled$crim, breaks = quantile(boston_scaled$crim), 
              include.lowest = TRUE, 
             labels = c("low", "med_low", "med_high", "high"))
# remove old crim variable from scaled data set
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add new categorical variable
boston_scaled <- data.frame(boston_scaled, crime)

# randomly select rows from data in ratio 4/5
boston_n <- nrow(boston_scaled)
boston_ind <- sample(boston_n,  size = boston_n * 0.8)

# create training data set by selecting 4/5 from original data set
boston_train <- boston_scaled[boston_ind,]

# create test data set by selecting 1/5 from original data set
boston_test <- boston_scaled[-boston_ind,]
# save crime target variable from test data set
boston_test_crime <- boston_test$crime
# remove the target variable from test
boston_test <- dplyr::select(boston_test, -crime)

We fit the linear discriminant analysis (LDA) on the training data set using the function lda(). A function plot_lda() is written using ggplot2 to visualize the LDA model. This LDA biplot is shown in figure @ref(fig:lda4). The different crime classes are visualized using colors, as well as normal data ellipses. Different components are visualized using arrows, with the length (and opacity) of each arrow calculated as \(l = \sqrt{\text{LD1}^2+\text{LD2}^2}\). We can see that the variable rad is the most significant separator between the classes, followed by zn and nox. It looks like the suburbs are clearly separated into two groups: those with high crime and those with less crime.

# draw lda biplot
plot_lda <- function(fit, data) {
  # get LD1, LD2 from fit
  lda_predict = predict(fit)$x
  predict_data = data.frame(
    LD1 = lda_predict[,1],
    LD2 = lda_predict[,2],
    Crime = data$crime)
  
  # calculate data for biplot arrows (length, angle)
  lda_data <- data.frame(var=rownames(coef(fit)), coef(fit))
  lda_data$length <- with(lda_data, sqrt(LD1^2 + LD2^2))
  lda_data$angle <- atan2(lda_data$LD2, lda_data$LD1)
  # text positions
  lda_data$x_end <- cos(lda_data$angle) * lda_data$length
  lda_data$y_end <- sin(lda_data$angle) * lda_data$length
  
  # plot points, normal data ellipses and arrows
  ggplot(data=predict_data, aes(x=LD1,y=LD2,col=Crime)) + geom_point() + 
    stat_ellipse(aes(fill = Crime), geom = "polygon", alpha = .3) +
    geom_spoke(aes(0, 0, angle = angle, alpha = length, radius = length), 
               lda_data, color = "red", size = 0.5, 
               arrow = arrow(length = unit(0.2, "cm"))) +
    geom_text(aes(y = y_end, x = x_end, label = var, alpha = length),
            lda_data, size = 4, vjust = .5, hjust = 0, color = "red")
}

# fit linear discriminant analysis on train data set
boston_lda <- lda(crime ~ ., data = boston_train)

# calculate LDA length
lda_data2 <- data.frame(var=rownames(coef(boston_lda)), coef(boston_lda))
lda_data2$length <- with(lda_data2, sqrt(LD1^2 + LD2^2))

# make LDA biplot
plot_lda(boston_lda, boston_train)
LDA biplot for Boston training set.\label{fig:lda4}

LDA biplot for Boston training set.

We predict the crime classes using the LDA model with the test data set and compare the predictions to the correct crime classes. These predictions are cross-tabulated in table @ref(tab:ct4). The LDA model prediction accuracy is 57% for the low class, 80% for med_low, 53% for med_high and 96% for high. The model misclassifies some low points as med_low and some med_high points as med_low. The total error rate of the model is 30%.

# predict values from LDA using test data set
boston_predict_test <- predict(boston_lda, newdata = boston_test)

# cross-tabulation for predicted crime class
cross_tab <- table(correct = boston_test_crime, 
                    predict = boston_predict_test$class)
rownames(cross_tab) <- paste(rownames(cross_tab), ".correct", sep="")
colnames(cross_tab) <- paste(colnames(cross_tab), ".predict", sep="")

kable(cross_tab, caption = "Cross-tabulation for predicted crime class from LDA.") %>% 
 kable_styling()
Cross-tabulation for predicted crime class from LDA.
low.predict med_low.predict med_high.predict high.predict
low.correct 13 9 1 0
med_low.correct 2 12 1 0
med_high.correct 0 17 19 0
high.correct 0 0 1 27

Distances between observations in the scaled Boston data set are calculated using the dist() function with methods euclidean and manhattan. These methods calculate the distance \(d\) between observations as \(d^2 = \sum_i \Delta x_i^2\) and \(d = \sum_i |\Delta x_i|\) respectively (summing over dimension \(i\)). The Manhattan distance is the \(L^1\) norm and the Euclidean distance is the \(L^2\) norm. Summaries of these distances are shown in table @ref(tab:smm4). We can see that all the values for the Manhattan distance are larger than for the Euclidean distance, which is what we would expect from the triangle inequality.

# reload scaled boston data set
boston_scaled <- as.data.frame(scale(Boston))

# calculate Euclidean distance matrix sqrt(dx^2 + dy^2)
boston_dist <- dist(boston_scaled)
# calculate Manhattan distance matrix |dx| + |dy|
boston_dist_man <- dist(boston_scaled, method = "manhattan")

# plot distances as graphs
# library(qgraph)
# par(mfrow=c(1, 2))
# qgraph(1 / as.matrix(boston_dist), layout='spring', vsize=3)
# title("Euclidean",line=2.5)
# qgraph(1 / as.matrix(boston_dist_man), layout='spring', vsize=3)
# title("Manhattan",line=2.5)
dist_summary <- apply(data.frame(boston_dist), 2, summary)
dist_man_summary <- apply(data.frame(boston_dist_man), 2, summary)
dist_table <- data.frame(Euclidean=unname(dist_summary), 
                 Manhattan=unname(dist_man_summary))
rownames(dist_table) <- row.names(dist_summary)
kable(dist_table, 
      caption = "Summary of Euclidean and Manhattan distances.", digits=2) %>% 
  kable_styling()
Summary of Euclidean and Manhattan distances.
Euclidean Manhattan
Min. 0.13 0.27
1st Qu. 3.46 8.48
Median 4.82 12.61
Mean 4.91 13.55
3rd Qu. 6.19 17.76
Max. 14.40 48.86

We run the \(k\)-means algorithm on the scaled Boston data set, first with \(k=3\). To find the optimal value for \(k\), we calculate the total within-cluster sum of squares (TWCSS) for a range \(k \in [1, 10]\). These are averaged over 10 runs to reduce random variation. The resulting graph is plotted in figure @ref(fig:km4). We can see that the largest reduction in TWCSS occurs at \(k=2\), while for larger \(k\) the differences are less significant. According to the elbow method, the optimal value would likely be \(k=2\).

boston_km <- kmeans(boston_scaled, centers = 3)

k_max <- 10 # maximum k
km_n <- 10 # number of k-means twcss to average
twcss <- rep(0, k_max)
for (i in 1:km_n) {
  twcss <- twcss + sapply(1:k_max, function(k) kmeans(boston_scaled, k)$tot.withinss) / km_n
}

ggplot() + aes(x = 1:k_max, y = twcss) +
  geom_line() + geom_point() +
  scale_x_continuous(breaks=1:k_max) +
  xlab("k") + ylab("TWCSS")
Total within-cluster sum of squares for different $k$.\label{fig:km4}

Total within-cluster sum of squares for different \(k\).

k_best <- which.max(twcss[1:(k_max-1)]-twcss[2:k_max])+1

boston_km <- kmeans(boston_scaled, centers = k_best)

The \(k\)-means clusters are plotted using ggpairs() in figure @ref(fig:pp4). The scatter plots and histograms are colored based on which of the two clusters the data points belong to. We can see that the scatter plots show quite good separation between the groups for most combinations of variables, and the blue and red areas mostly don’t overlap. Similarly, we see that the distributions of the variables are mostly different, though there are exceptions (the variables chas and rm for example). It’s also notable that the correlation coefficients between variables are different for the two clusters. These observations suggest that the two clusters represent some actual separate groups in the data. Based on @ref(fig:hc4), we can determine that the crime rate is effectively separated by cluster, and all the crime rates in lower cluster are below the mean. This would suggest that the data has been separated into low-crime and high-crime clusters.

ggpairs(boston_scaled, mapping = aes(color=factor(boston_km$cluster), alpha=0.7))
Pair plot of scaled Boston data set colored by $k$-means cluster.\label{fig:pp4}

Pair plot of scaled Boston data set colored by \(k\)-means cluster.

crime_cluster <- data.frame(crim=boston_scaled$crim, cluster=factor(boston_km$cluster))
ggplot(crime_cluster, aes(x = crim, col = cluster)) + geom_boxplot()
Box plot of crime rate by cluster.\label{fig:hc4}

Box plot of crime rate by cluster.

Bonus

We run \(k\)-means clustering for \(k \in [2, 4]\) for the scaled Boston data set. LDA is then run on the data set with the \(k\)-means cluster as the target variable. The LDA biplots are shown in figure @ref(fig:lk4), colored by cluster. We get perhaps the best separation using \(k=4\). Depending on number of clusters, we get different variables that are influential. The lengths of the different variable vectors are given in table @ref(tab:ll4). For \(k=3\), the most significant variables are rad, tax and age. For \(k=4\), the most significant variables are rad, zn and tax. For \(k=5\), the most significant variables are black, nox and tax. The many of these most significant variables are closely related to location, which seems to be important in determining crime rate.

set.seed(6344) # get rid of randomness
# reload scaled boston data set
boston_scaled <- as.data.frame(scale(Boston))

# draw lda biplot
plot_lda <- function(fit, data) {
  # get LD1, LD2 from fit
  lda_predict = predict(fit)$x
  predict_data = data.frame(
    LD1 = lda_predict[,1],
    LD2 = lda_predict[,2],
    Cluster = factor(data$cluster))
  
  # calculate data for biplot arrows (length, angle)
  lda_data <- data.frame(var=rownames(coef(fit)), coef(fit))
  lda_data$length <- with(lda_data, sqrt(LD1^2 + LD2^2))
  lda_data$angle <- atan2(lda_data$LD2, lda_data$LD1)
  # text positions
  lda_data$x_end <- cos(lda_data$angle) * lda_data$length
  lda_data$y_end <- sin(lda_data$angle) * lda_data$length
  
  # plot points, normal data ellipses and arrows
  ggplot(data=predict_data, aes(x=LD1,y=LD2,col=Cluster)) + geom_point() + 
    stat_ellipse(aes(fill = Cluster), geom = "polygon", alpha = .3) +
    geom_spoke(aes(0, 0, angle = angle, alpha = length, radius = length), 
               lda_data, color = "red", size = 0.5, 
               arrow = arrow(length = unit(0.2, "cm"))) +
    geom_text(aes(y = y_end, x = x_end, label = var, alpha = length),
            lda_data, size = 4, vjust = .5, hjust = 0, color = "red")
}

boston_km3 <- kmeans(boston_scaled, centers = 3)
# fit linear discriminant analysis with k-mean cluster as target
boston_lda3 <- lda(boston_km3$cluster ~ ., data = boston_scaled)
lda3 <- plot_lda(boston_lda3, boston_km3)
lda_data3 <- data.frame(var=rownames(coef(boston_lda3)), coef(boston_lda3))
lda_data3$length <- with(lda_data3, sqrt(LD1^2 + LD2^2))

boston_km4 <- kmeans(boston_scaled, centers = 4)
# fit linear discriminant analysis with k-mean cluster as target
boston_lda4 <- lda(boston_km4$cluster ~ ., data = boston_scaled)
lda4 <- plot_lda(boston_lda4, boston_km4)
lda_data4 <- data.frame(var=rownames(coef(boston_lda4)), coef(boston_lda4))
lda_data4$length <- with(lda_data4, sqrt(LD1^2 + LD2^2))

boston_km5 <- kmeans(boston_scaled, centers = 5)
# fit linear discriminant analysis with k-mean cluster as target
boston_lda5 <- lda(boston_km5$cluster ~ ., data = boston_scaled)
lda5 <- plot_lda(boston_lda5, boston_km5)
lda_data5 <- data.frame(var=rownames(coef(boston_lda5)), coef(boston_lda5))
lda_data5$length <- with(lda_data5, sqrt(LD1^2 + LD2^2))

grid.arrange(lda3, lda4, lda5, ncol=3, nrow=1)
LDA biplots for $k$-means clusters.\label{fig:lk4}

LDA biplots for \(k\)-means clusters.

llen_table <- data.frame(LDA3Length=lda_data3$length, 
                         LDA4Length=lda_data4$length,
                         LDA5Length=lda_data5$length)
rownames(llen_table) <- lda_data3$var

kable(llen_table, 
      caption = "LDA lengths for different variables and different $k$.", digits=2) %>% 
  kable_styling()
LDA lengths for different variables and different \(k\).
LDA3Length LDA4Length LDA5Length
crim 0.21 0.19 0.71
zn 0.36 1.51 0.17
indus 0.34 0.42 0.73
chas 0.14 0.12 0.02
nox 0.20 0.04 0.94
rm 0.46 0.18 0.48
age 0.86 0.60 0.22
dis 0.63 0.54 0.10
rad 2.05 6.14 0.53
tax 1.23 0.64 0.89
ptratio 0.13 0.31 0.34
black 0.17 0.04 2.35
lstat 0.20 0.14 0.28
medv 0.37 0.15 0.15

Super-Bonus

We plot the LDA and \(k\)-means for the Boston training data set in 3D using the matrix product and function plot_ly() as given. For the LDA plot, the color is the crime class, while the \(k\)-means plot is colored using the cluster. We choose \(k=4\) for plotting.

We can see that the LDA plot shows a similar shape as previously, except now the third dimension LD3 is also shown. We see good separation between the classes, especially the high class and the others. For the \(k\)-means plot, we see a fairly similar result. There is a clear difference between one group and the rest, though in LDA this separation is perhaps cleaner. Similarly, there is separation between clusters in the larger set of clusters, but perhaps with clearer spatial separation. In the LDA plot the classes are overlapping more than the \(k\)-means clusters, which makes sense as the clustering is based on distance. The \(k\)-means clustering appears to be somewhat approximating the crime classification.

set.seed(1919) # get rid of randomness
model_predictors <- dplyr::select(boston_train, -crime)
boston_train_km4 <- kmeans(model_predictors, centers = 4)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(boston_lda$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% boston_lda$scaling
matrix_product <- as.data.frame(matrix_product)

lda1 <- plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, 
        color = boston_train$crime, type = 'scatter3d', mode='markers', 
        marker=list(size = 3)) %>% 
  layout(title = "Crime classes", scene = list(xaxis = list(title = 'LD1'),
                     yaxis = list(title = 'LD2'),
                     zaxis = list(title = 'LD3')))

lda2 <- plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3,
        color = factor(boston_train_km4$cluster), type = 'scatter3d', mode='markers',
        marker=list(size = 3)) %>%
  layout(title = "K-means clusters", scene = list(xaxis = list(title = 'LD1'),
                     yaxis = list(title = 'LD2'),
                     zaxis = list(title = 'LD3')))

  1. https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html↩︎